feat: Implement Observability 2.0 wide events#41
Merged
Conversation
Extract actionable guidelines from modern logging philosophy and analyze delta with Keyboardia's current observability approach. Key insights: - Wide events (emit once per lifecycle) vs narrow events (current) - Tail sampling strategy to reduce KV writes by 90% - Correlation IDs for end-to-end tracing - Structured JSON console output for wrangler tail Includes 4-phase implementation roadmap with code examples.
Incorporate industry best practices from Cloudflare Workers docs and Charity Majors / Honeycomb philosophy: - Add Observability 1.0 vs 2.0 framework (single source of truth) - Add Cloudflare-native config (wrangler.toml, Workers Logs, head sampling) - Recommend Workers Logs over KV storage (7-day retention, no quota impact) - Add derived metrics principle (compute from events, don't collect separately) - Add query-first mindset for unknown unknowns - Add automatic instrumentation section (invocation_logs, traces) - Expand sampling to cover head + tail strategy - Add SQL query examples for Workers Logs Query Builder - Update sources with Cloudflare and Honeycomb references Note: OpenTelemetry export intentionally excluded per requirements.
Organizational changes reflecting the generational shift: - Rename LOGGING-GUIDELINES.md → OBSERVABILITY-2-0.md - Move OBSERVABILITY.md → specs/archive/ (Phase 7 implementation) - Add archive notice to legacy doc Content additions to Observability 2.0: - Add HTTP wide events for session lifecycle (create, access, publish, remix) - Add retention strategy for 30-day admin metrics via KV daily rollups - Add scheduled worker pattern for aggregating Workers Logs → KV - Define query strategy by time range (Workers Logs vs KV) This enables admin dashboards to answer: - Sessions created/published/remixed over 1, 7, 30 days - Remix rates and most-remixed sessions - Long-term trends beyond Workers Logs 7-day retention
Changes: - Remove section 2.6 (KV-based 30-day retention strategy) Accept Workers Logs 7-day limit as the retention boundary - Add Appendix D: Complete Wide Events Catalog Comprehensive event definitions from codebase audit: Session Lifecycle (5 events): - session_created, session_accessed, session_updated - session_published, session_remixed WebSocket Events (3 events): - ws_session_end (primary wide event with full context) - ws_player_joined, ws_player_left Error Events (6 events): - error_rate_limit, error_quota_exceeded, error_validation - error_ws_connection, error_invariant, error_mutation_rejected Sync Events (3 events): - sync_hash_mismatch, sync_snapshot_sent, sync_client_behind Playback Events (2 events): - playback_started, playback_stopped - Add volume estimates (~3,000-25,000 events/day) - Add "What we intentionally don't track" section - Add query recipes for common analytics questions Total: 19 wide event types covering complete system observability
Replace vague ranges with explicit calculations based on: - 500 DAU baseline assumption - Per-user behavior rates (sessions created, accesses, multiplayer adoption) - Derived event volumes with clear math Add scaling projection table showing linear growth from 50 to 500K DAU. Even at 500K DAU, event volume is only 0.2% of Workers Logs limit.
Convert from 1,177-line implementation spec to 155-line research document. Changes: - Remove implementation phases, code examples, event schemas - Remove appendices (wide event catalog, sampling tree, query recipes) - Keep core philosophy (Obs 1.0 vs 2.0, wide events pattern) - Keep sources (Charity Majors, loggingsucks.com, Cloudflare) - Keep applicability analysis and trade-offs - Add clear recommendation: "No immediate action required" Restore OBSERVABILITY.md as the active implementation doc. OBSERVABILITY-2-0.md is now reference material for future decisions.
…arch - Update event volume estimates with 1,000 DAU as primary baseline - Add 30 DAU early launch and growth scenarios for context - Add cost analysis section showing Workers Logs pricing - Document that Obs 2.0 would slightly reduce costs while improving retention
Defines three wide events: - http_request_end: Full HTTP request lifecycle - ws_session_end: WebSocket connection lifecycle with message stats - error: Structured error tracking Includes TypeScript schemas, examples, implementation patterns, migration path, and effort estimates (~13 hours total).
- Rename OBSERVABILITY-2-0-IMPL.md to OBSERVABILITY-2-0-IMPLEMENTATION.md - Move OBSERVABILITY-2-0.md to specs/research/ (research docs folder) - Update cross-references between documents
- Change config examples from wrangler.toml to wrangler.jsonc - Replace userId with playerId (matches Keyboardia terminology) - Add playerId to http_request_end example - Fix typo: oderId → playerId in ws_session_end schema - Add "sessions created per unique user" to queryable questions
- Add "Designing Wide Events" section with 6 principles - Include litmus test for event width - Update config examples from wrangler.toml to wrangler.jsonc
Key changes: - Add isCreator boolean to ws_session_end (most users are joiners) - Add action: "access" for session joins (vs "create") - Add Design Decisions section with included/excluded tables - Add Typical Traces showing creator vs joiner flows - Add Architecture Sequence Diagram showing all layers - Update examples to show joiner perspective (majority case)
New field enables answering: - Are people mostly consuming published content? - Do people spend more time on published vs editable sessions? - Which published sessions get the most views/attention? Updated both http_request_end and ws_session_end schemas, examples, queryable questions, and design decision tables.
Add insights from Charity Majors, Stripe's Canonical Log Lines: - "High cardinality is the feature, not the bug" - "Never aggregate at write-time" principle - The "Stuff the Blob" pattern from Stripe - Anti-patterns to avoid (low-dimensionality, PII, grep-oriented logs)
Document how engagement signals differ between consumption modes: - Published: passive (play/stop), solo viewing, shallow but broad - Editable: active (toggle_step, etc.), collaboration, deep but narrow
Key additions to http_request_end: - sourceSessionId: Track which session was remixed/published FROM - deviceType: mobile vs desktop segmentation (derived from User-Agent) New examples showing: - Remix action with sourceSessionId for virality tracking - Publish action capturing the publishing flow New queryable questions: - "Which sessions generated the most remixes?" - "Are mobile users more likely to consume or create?" Documents isCreator derivation (Option 4): correlate playerId from action="create" with ws_session_end events for same sessionId.
- isCreator now determined by comparing CF-Connecting-IP + User-Agent hash with stored creatorIdentity from session creation - More reliable than ephemeral playerId (server-generated per connection) - Added CreatorIdentity interface and hashUserAgent helper - Documented limitations (VPN changes, different browsers) - Updated trace diagrams to show IP-based creator identification
- Fix async/await in code examples (handleSessionCreate, handleWebSocketConnect) - Fix Map<string, WsContext> to Map<WebSocket, WsContext> - Clarify IP address exclusion (used server-side for isCreator, not logged) - Add cross-reference to implementation spec from OBSERVABILITY.md
- Align event volume estimates: ~11,500/day at 1K DAU (was ~16,000 in research) - Align effort estimates: ~13 hours (was ~15 in research) - Add note that HTTP middleware example is simplified - Add cross-reference from research doc to implementation spec
Explains why the event is guaranteed (Cloudflare ping/pong) and timing implications for clean vs dirty disconnects. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Phase 1: POST /api/errors endpoint - Phase 2: sendBeacon on pagehide - Phase 3: WebSocket client_error message type Includes transport flow diagram, client-specific schema fields, and error type classification table. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add warnings?: Warning[] to HttpRequestEndEvent and WsSessionEndEvent - Define Warning type with recoveryAction discriminator - Document warning types (KVReadRetry, SlowDO, StateRepair, etc.) - Add collection mechanism: explicit parameter for HTTP, instance Map for WS - Max 10 warnings per event to prevent unbounded growth Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…vents - Add deploy object (versionId, versionTag, deployedAt) from CF_VERSION_METADATA - Add infra object (colo, country) from request.cf - Add service object (name, environment) for identity - Document wrangler.jsonc configuration for version_metadata binding - Include code examples for accessing CF metadata in handlers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add slug field for machine-readable error classification - Add expected boolean to distinguish anticipated vs unexpected errors - Add deploy/infra/service metadata for consistency with other events - Update "What to report" table with expected values and example slugs - Add slug naming convention guidance Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add deploy/infra/service to all three event tables - Add slug and expected to error event table - Clarify geo exclusion (detailed geo excluded, colo+country sufficient) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…vents Major restructuring per Boris Tane / Honeycomb guidance: - Rename http_request_end → http_request, ws_session_end → ws_session - Remove separate `error` event type (violates wide event philosophy) - Add `outcome: "ok" | "error"` field to both events (Boris Tane pattern) - Add `error` object with type, message, slug, expected, handler, stack - Keep `client_error` as exception (no parent server-side unit of work) - Update Design Decisions tables - Update Event Volume estimates - Add error example for http_request Key insight: "ONE event per unit of work" - errors should be embedded in the parent event for full context correlation, not separate events. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace per-action KV logging with lifecycle-based wide events emitted to Cloudflare Workers Logs via console.log(JSON.stringify(...)). Two event types implemented: - http_request: One per HTTP request with embedded errors - ws_session: One per WebSocket connection, emitted at disconnect Key changes: - Add observability.ts with event schemas, helpers, and emission - Add route-patterns.ts for route pattern matching - Update index.ts to emit http_request events instead of KV logs - Update live-session.ts with PlayerObservability tracking and ws_session emission - Clean up logging.ts to keep only state hashing utilities - Configure wrangler.jsonc with observability, version_metadata, and env vars - Update spec to defer client_error (violates wide event principle) Events include deployment metadata (version ID, tag), infrastructure info (colo, country), and service identity. Errors are embedded with type, message, slug, expected flag, and optional stack trace. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Complete the three previously deferred items in Observability 2.0: 1. isCreator Detection (IP + User-Agent hash): - Add CreatorIdentity interface and helper functions - Store creator identity on first WebSocket connect - Persist to DO storage for hibernation survival - Compare on subsequent connections to detect creator - Include in ws_session event emission 2. syncRequestCount and syncErrorCount: - Track client-requested snapshot recovery - Track proactive sync when client falls behind (ACK gap) - Track sync errors when state is unavailable - Include counters in ws_session event 3. responseSize in HTTP events: - Add responseSize option to emitEvent helper - Calculate size using TextEncoder for key endpoints - Include in events for: session create, GET, remix, publish All items verified by audit against spec. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add test:e2e:full-stack script that runs E2E tests against the complete Cloudflare Worker stack (wrangler dev) instead of just the Vite dev server. This enables testing of Durable Objects, KV storage, and Worker API endpoints. Changes: - Add scripts/test-e2e-full-stack.ts: builds project, starts wrangler dev, runs Playwright tests, and cleans up - Add /api/health endpoint for test runner health checks - Update playwright.config.ts to support PLAYWRIGHT_BASE_URL env var - Fix hardcoded URL in pitch-contour-alignment.spec.ts to use API_BASE - Add npm scripts: test:e2e:full-stack and test:e2e:full-stack:smoke Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implement lifecycle-based wide events per the Observability 2.0 pattern, replacing per-action KV logging with structured events emitted to Cloudflare Workers Logs.
Two event types
http_request— One per HTTP request with embedded errors, timing, and contextws_session— One per WebSocket connection, emitted at disconnect with full session statsKey features
New tooling
npm run test:e2e:full-stack— Run E2E tests against wrangler dev (full Cloudflare stack)/api/healthendpoint for monitoring and test runnersFiles changed
observability.ts,route-patterns.ts(new)index.ts,live-session.ts,logging.tswrangler.jsonc,types.tsplaywright.config.ts,test-e2e-full-stack.ts(new)Test plan
🤖 Generated with Claude Code